What is parse-latin?
The parse-latin npm package is a JavaScript library used to parse Latin-script natural language into a syntax tree. It is particularly useful for text processing tasks such as tokenization, sentence splitting, and word segmentation.
What are parse-latin's main functionalities?
Tokenization
This feature allows you to tokenize a given text into individual tokens (words, punctuation, etc.). The code sample demonstrates how to tokenize a simple sentence.
const ParseLatin = require('parse-latin');
const parser = new ParseLatin();
const tokens = parser.tokenize('This is a sentence.');
console.log(tokens);
Sentence Splitting
This feature enables you to split a paragraph into individual sentences. The code sample shows how to split a paragraph into separate sentences.
const ParseLatin = require('parse-latin');
const parser = new ParseLatin();
const sentences = parser.tokenizeParagraph('This is a sentence. This is another sentence.');
console.log(sentences);
Word Segmentation
This feature allows you to segment a sentence into individual words. The code sample demonstrates how to segment a sentence into words.
const ParseLatin = require('parse-latin');
const parser = new ParseLatin();
const words = parser.tokenizeWords('This is a sentence.');
console.log(words);
Other packages similar to parse-latin
compromise
Compromise is a natural language processing library for JavaScript that provides a wide range of text processing functionalities, including tokenization, part-of-speech tagging, and named entity recognition. Compared to parse-latin, Compromise offers more advanced NLP features and is more versatile.
natural
Natural is a general natural language processing library for JavaScript. It includes functionalities such as tokenization, stemming, classification, and phonetics. Natural is more feature-rich compared to parse-latin and is suitable for a wide range of NLP tasks.
parse-latin
A Latin script language parser producing NLCST nodes.
- For semantics of nodes, see NLCST;
- For a pluggable system to analyze and manipulate language, see retext.
Whether Old-English (“þā gewearþ þǣm hlāforde and þǣm hȳrigmannum wiþ ānum penninge”), Icelandic (“Hvað er að frétta”), French (“Où sont les toilettes?”), parse-latin does a good job at tokenizing it.
Note also that parse-latin does a decent at tokenizing Latin-like scripts, Cyrillic (“Добро пожаловать!”), Georgian (“როგორა ხარ?”), Armenian (“Շատ հաճելի է”), and such.
Installation
npm:
$ npm install parse-latin
Component:
$ component install wooorm/parse-latin
Bower:
$ bower install parse-latin
Usage
var ParseLatin = require('parse-latin'),
latin = new ParseLatin();
latin.parse('A simple sentence.');
latin.parse(
'The \xC5 symbol invented by A. J. A\u030Angstro\u0308m ' +
'(1814, Lo\u0308gdo\u0308, \u2013 1874) denotes the ' +
'length 10\u207B\xB9\u2070 m.'
);
API
ParseLatin()
Exposes the functionality needed to tokenize natural Latin-script languages into a syntax tree.
ParseLatin#tokenize(value)
Tokenize natural Latin-script language into letter and numbers (words), white space, and everything else (punctuation).
ParseLatin#parse(value)
Tokenize natural Latin-script languages into an NLCST syntax tree.
var ParseLatin = require('parse-latin'),
latin = new ParseLatin();
latin.parse('A simple sentence.');
Syntax Tree Format
Note: The easiest way to see how parse-latin tokenizes and parses, is by using the online parser demo, which shows the syntax tree corresponding to the typed text.
Basically, parse-latin splits text into white space, word, and punctuation tokens. parse-latin starts out with a pretty easy definition, one that most other tokenizers use:
- A “word” is one or more letter or number characters;
- A “white space” is one or more white space characters;
- A “punctuation” is one or more of anything else;
Then, it manipulates and merges those tokens into an NLCST syntax tree, adding sentences and paragraphs where needed.
- Some punctuation marks are part of the word they occur in, e.g.,
non-profit
, she\'s
, G.I.
, 11:00
, N/A
, &c
, nineteenth- and...
; - Some full-stops do not mark a sentence end, e.g.,
1.
, e.g.
, id.
; - Although full-stops, question marks, and exclamation marks (sometimes) end a sentence, that end might not occur directly after the mark, e.g.,
.)
, ."
; - And many more exceptions.
Benchmark
On a MacBook Air, parse-latin parses 2 large books, 25 big articles, or 2,056 paragraphs per second.
To put things into perspective, Shakespeare’s works contain 884,647 words. I have not tested it, but in theory parse-latin should parse these works in (slightly above) four seconds.
latin.parse(document);
2,056 op/s » A paragraph (5 sentences, 100 words)
267 op/s » A section (10 paragraphs)
25 op/s » An article (10 sections)
2 op/s » A (large) book (10 articles)
Related
License
MIT © Titus Wormer